PROMPI: a DiRAC RSE success story

DiRAC Science Day 2024

Miren Radia

University of Cambridge

Thursday 12 December 2024

Introduction

The team

Raphael Hirschi
Professor of Stellar Hydrodynamics and Nuclear Astrophysics
Keele University

Vishnu Varma
Research Associate In Theoretical Stellar Astrophysics
Keele University

Miren Radia
Research Software Engineer
University of Cambridge

Others

  • Federico Rizzuti, former PhD Student, Keele University
  • Caitlyn Chambers, new PhD student, Keele University

PROMPI

What does the code do?

  • PROMPI is a fluid dynamics code that is used to simulate complex hydrodynamic processes within stars.
  • Numerical methods:
    • Finite volume
    • Eulerian
    • Piecewise Parabolic Method (PPM) hydrodynamics scheme
  • Physics:
    • Fully compressible fluids
    • Nuclear burning
    • Convection/turbulence
  • Code:
    • Fortran
    • Parallelised with domain decomposition distributed with MPI

Evolution of \(|\mathbf{v}|\) for a \(1024^3\) simulation of the Carbon-burning shell

Previous RSE work

What improvements had already been made to the code?

Over several DiRAC RSE projects, the code has been enhanced and modernized in several different ways:

  • Acceleration on Nvidia GPUs using OpenACC
  • Fortran 77 → Modern free-form Fortran
  • Object-oriented design (Fortran 2003)
  • Legacy include statements and common blocks → Modules
  • Custom Makefile build system → CMake
  • Custom binary I/O format → HDF5
  • Regression tests and GitLab CI pipeline to run them

This project

Aims

What still needed to be done for the new code to be research-ready?

Despite the enhancements, there was still work that needed to be done before the group felt they could switch over:

  1. Consistency between the results on GPU and CPU.
  2. Optimal performance on the DiRAC systems the group uses (COSMA8 and Tursa).
  3. Porting and testing of physics modules and initial conditions to simulate specific scenarios from the old version of the code.
  4. Poor scaling on GPUs beyond a single GPU.

Work summary

What improvements were made to the code?

During the project, changes I made include:

  • Improvements and updates to the CMake build system.
  • Dependency software stack creation on Tursa and greenHPC (Keele local system).
  • Refactoring, updating and adding to the test and CI frameworks.
  • Fixing and refactoring the analysis/plotting Python scripts.
  • Significant refactoring of the MPI communication.
  • Fixing the HDF5 checkpoint and restart consistency.
  • Benchmarking and scaling analysis.

Improving MPI communication

The problem

What was causing such poor performance on GPUs?

Previously the code used:

  • Nvidia managed memory extension to OpenACC:
    • The runtime automatically migrates data between host (CPU) and device (GPU) as required.
  • MPI derived datatypes:
    • MPI_Type_vector to simplify halo/ghost cell exchange since this data is non-contiguous in memory, albeit regularly spaced: Non-contiguous memory layout in an MPI_Type_vector
  • Effectively blocking MPI calls:
    • The ghosts for each variable were all sent in separate MPI_Isends.
    • However, MPI_Wait was called after every MPI_Irecv.

This combination meant lots of small host-device data migrations → bad for performance:

  • For a \(512^3\) test simulation running on 8 Tursa Nvidia A100s (2 nodes), > 90% of the walltime was spent in communication.

The solution1

How was it sped up?

I significantly refactored the communication in the following ways:

  • Manual packing and unpacking of data:

    • No more MPI_Type_vector.
    • Single send/receive buffer for each pair of communicating processes. (data from all variables).
    • Uses asynchronous OpenACC kernels to [un]pack data on the GPU.
  • Forced use of GPU-aware MPI:

    !$acc host_data use_device(send_buf)
    call mpi_isend(send_buf, ...)
  • MPI_Waitall after all sends and receives for each direction.

After these changes with our \(512^3\) test case on 8 Tursa Nvidia A100s:

  • ~200x speed-up in communication leading to ~20x overall speed-up.
  • < 10% of the walltime spent in communication.

Benefits

Scaling

How does PROMPI perform after these improvements?

Weak scaling on Tursa

  • Excellent weak scaling of 88% efficiency up to 128 GPUs.
  • Most relevant scaling for group given typical research workflows.

Strong scaling on Tursa and COSMA8

  • Good strong scaling (>50% efficiency) up to around 32 Tursa Nvidia A100 80GB GPUs.
  • Efficiency drops for greater numbers due to GPU underutilization.
  • Grey line shows roughly how many COSMA8 (Milan) nodes are equivalent to 1 Tursa GPU.

Other benefits

What else has been achieved as a result of this project?

Performance and scaling improvements were not the only outcomes of this project. Other benefits include:

  • Sustainability and maintainability improvements from:
    • Better CI infrastructure.
    • Additional tests and increased robustness.
  • HPC system support from:
    • Improvements to CMake including support for newer compilers.
    • Software dependencies built on Tursa and local HPC.
  • Capabilities:
    • Better reproducibility from HDF5 checkpoint improvements.
    • More useful and accurate visualizations from plotting script improvements.

Any questions?